Duke University Libraries
Center for Data and Visualization Sciences
February 4, 2026
01. What is polars?
02. Why polars?
03. How do I import data?
04. How can I explore my data?
05. How can I transform my data?
“Come for the speed, stay for the API”
- Janssens and Nieuwdorp, 2025
The read_csv() function creates polars DataFrame from a csv file.
| flight | cost | distance_km | non_stop |
|---|---|---|---|
| str | f64 | i64 | bool |
| "Raleigh Durham to Svalbard" | null | 6209 | false |
| "Raleigh Durham to London" | 1200.0 | 6216 | true |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true |
| "Raleigh Durham to Boston" | 250.0 | 985 | true |
| "Raleigh Durham to Chicago" | 300.0 | 1040 | false |
| "Raleigh Durham to New York" | 420.0 | 693 | true |
The write_csv() function creates a csv file to preserve the data.
01. Evaluating data size
Polars includes the shape of the Dataframe in its default output.
(shape: (6,4))
| flight | cost | distance_km | non_stop |
|---|---|---|---|
| str | f64 | i64 | bool |
| "Raleigh Durham to Svalbard" | null | 6209 | false |
| "Raleigh Durham to London" | 1200.0 | 6216 | true |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true |
| "Raleigh Durham to Boston" | 250.0 | 985 | true |
| "Raleigh Durham to Chicago" | 300.0 | 1040 | false |
| "Raleigh Durham to New York" | 420.0 | 693 | true |
You can also check the properties of a DataFrame programmatically:
01. Evaluating size
02. Browsing
head() list the first five rows of data
| flight | cost | distance_km | non_stop |
|---|---|---|---|
| str | f64 | i64 | bool |
| "Raleigh Durham to Svalbard" | null | 6209 | false |
| "Raleigh Durham to London" | 1200.0 | 6216 | true |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true |
| "Raleigh Durham to Boston" | 250.0 | 985 | true |
| "Raleigh Durham to Chicago" | 300.0 | 1040 | false |
tail() list the final five rows of data
| flight | cost | distance_km | non_stop |
|---|---|---|---|
| str | f64 | i64 | bool |
| "Raleigh Durham to London" | 1200.0 | 6216 | true |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true |
| "Raleigh Durham to Boston" | 250.0 | 985 | true |
| "Raleigh Durham to Chicago" | 300.0 | 1040 | false |
| "Raleigh Durham to New York" | 420.0 | 693 | true |
By default, Polars will print the first and last 5 rows of a DataFrame when printed.
If you want to see more rows:
01. Evaluating data size
02. Browsing DataFrames
03. Understanding columns
columns lists the column names
schema lists columns names and data types
01. Evaluating data size
02. Browsing
03. Understanding columns
04. Considering descriptive statistics
Often, we want to assess the associated statistics for each column.
describe() provides descriptive statistics for DataFrames
| statistic | flight | cost | distance_km | non_stop |
|---|---|---|---|---|
| str | str | f64 | f64 | f64 |
| "count" | "6" | 5.0 | 6.0 | 6.0 |
| "null_count" | "0" | 1.0 | 0.0 | 0.0 |
| "mean" | null | 554.0 | 3123.0 | 0.666667 |
| "std" | null | 385.460763 | 2612.571989 | null |
| "min" | "Raleigh Durham to Boston" | 250.0 | 693.0 | 0.0 |
| "25%" | null | 300.0 | 985.0 | null |
| "50%" | null | 420.0 | 3595.0 | null |
| "75%" | null | 600.0 | 6209.0 | null |
| "max" | "Raleigh Durham to Svalbard" | 1200.0 | 6216.0 | 1.0 |
01. Evaluating data size
02. Browsing
03. Understanding columns
04. Considering descriptive statistics
05. Handling missing data
null.| Task | polars “verb” |
|---|---|
| Subset columns | select() |
| Subset rows | filter() |
| Sort | sort() |
| Create a new variable | with_columns() |
| Aggregate by groups | group_by() |
select() chooses columnsselect() chooses columnsflights DataFrame has four columns.flight and cost| column | description |
|---|---|
| flight | destination of RDU flight |
| cost | cost of flight |
| distance_km | distance in km |
| non_stop | is flight non-stop |
**Note that the columns return in the order that I request them.
Often using the drop() method to remove columns is more efficient than select().
| flight | distance_km | non_stop |
|---|---|---|
| str | i64 | bool |
| "Raleigh Durham to Svalbard" | 6209 | false |
| "Raleigh Durham to London" | 6216 | true |
| "Raleigh Durham to Los Angeles" | 3595 | true |
| "Raleigh Durham to Boston" | 985 | true |
| "Raleigh Durham to Chicago" | 1040 | false |
| "Raleigh Durham to New York" | 693 | true |
select() reduces the DataFrame to the essential columns and improves perfomance.
Polars offers numerous options for more precise selections including:
flights.select(pl.col('^c.*$')) # columns starting with “c”flights.select(df.select(pl.col(pl.Int64))flights[:]The filter command identifies rows that you wish to include in a dataframe.
Let’s say that we wanted only the flights from RDU that were:
**So, our Boston and New York flights are both under 1000 kilometers from Durham.
Polars also has convenient tools for finding text:
**So, our Boston and New York flights are both under 1000 kilometers from Durham.
Polars allows complex filter queries by combining multiple conditions using boolean expressions. For most operations, you will use the following operators to build these expressions.
| Operator | Symbol |
|---|---|
| AND | & |
| OR | | |
| NOT | ~ |
Let’s say that we want to see all the flights that are:
Final thoughts on filter-ing:
sort() allows us to specify how the rows in the DataFrame should be ordered.
Let’s try a simple sort based on our filtered data.
| flight | cost | distance_km | non_stop |
|---|---|---|---|
| str | f64 | i64 | bool |
| "Raleigh Durham to New York" | 420.0 | 693 | true |
| "Raleigh Durham to Boston" | 250.0 | 985 | true |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true |
| "Raleigh Durham to London" | 1200.0 | 6216 | true |
In the above, note how sort defaults to ascending order.
| flight | cost | distance_km | non_stop |
|---|---|---|---|
| str | f64 | i64 | bool |
| "Raleigh Durham to London" | 1200.0 | 6216 | true |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true |
| "Raleigh Durham to Boston" | 250.0 | 985 | true |
| "Raleigh Durham to New York" | 420.0 | 693 | true |
It is entirely possible to specify more than one column for a sort. - Let’s look at our non-stop flights sort by the cost and the distance.
| flight | cost | distance_km | non_stop |
|---|---|---|---|
| str | f64 | i64 | bool |
| "Raleigh Durham to London" | 1200.0 | 6216 | true |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true |
| "Raleigh Durham to New York" | 420.0 | 693 | true |
| "Raleigh Durham to Boston" | 250.0 | 985 | true |
If you have missing data in a sort column, it will be listed first.
null values last by adding the nulls_last=True parameter:| flight | cost | distance_km | non_stop |
|---|---|---|---|
| str | f64 | i64 | bool |
| "Raleigh Durham to London" | 1200.0 | 6216 | true |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true |
| "Raleigh Durham to New York" | 420.0 | 693 | true |
| "Raleigh Durham to Chicago" | 300.0 | 1040 | false |
| "Raleigh Durham to Boston" | 250.0 | 985 | true |
| "Raleigh Durham to Svalbard" | null | 6209 | false |
with_columns() provides a convenient way to add columns to DataFrames.
| flight | cost | distance_km | non_stop | cost_per_kilometer |
|---|---|---|---|---|
| str | f64 | i64 | bool | f64 |
| "Raleigh Durham to Svalbard" | null | 6209 | false | null |
| "Raleigh Durham to London" | 1200.0 | 6216 | true | 0.19305 |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true | 0.166898 |
| "Raleigh Durham to Boston" | 250.0 | 985 | true | 0.253807 |
| "Raleigh Durham to Chicago" | 300.0 | 1040 | false | 0.288462 |
| "Raleigh Durham to New York" | 420.0 | 693 | true | 0.606061 |
Polars only makes changes to dataframes permanent when you save the change.
| flight | cost | distance_km | non_stop |
|---|---|---|---|
| str | f64 | i64 | bool |
| "Raleigh Durham to Svalbard" | null | 6209 | false |
| "Raleigh Durham to London" | 1200.0 | 6216 | true |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true |
| "Raleigh Durham to Boston" | 250.0 | 985 | true |
| "Raleigh Durham to Chicago" | 300.0 | 1040 | false |
| "Raleigh Durham to New York" | 420.0 | 693 | true |
Polars only makes changes to dataframes permanent when you save the change.
cost_per_kilometer permanently to our DataFrame.flights:| flight | cost | distance_km | non_stop | cost_per_kilometer |
|---|---|---|---|---|
| str | f64 | i64 | bool | f64 |
| "Raleigh Durham to Svalbard" | null | 6209 | false | null |
| "Raleigh Durham to London" | 1200.0 | 6216 | true | 0.19305 |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true | 0.166898 |
| "Raleigh Durham to Boston" | 250.0 | 985 | true | 0.253807 |
| "Raleigh Durham to Chicago" | 300.0 | 1040 | false | 0.288462 |
| "Raleigh Durham to New York" | 420.0 | 693 | true | 0.606061 |
Polars offers a second syntax style for creating columns.
.alias() method.| flight | cost | distance_km | non_stop | cost_per_kilometer |
|---|---|---|---|---|
| str | f64 | i64 | bool | f64 |
| "Raleigh Durham to Svalbard" | null | 6209 | false | null |
| "Raleigh Durham to London" | 1200.0 | 6216 | true | 0.19305 |
| "Raleigh Durham to Los Angeles" | 600.0 | 3595 | true | 0.166898 |
| "Raleigh Durham to Boston" | 250.0 | 985 | true | 0.253807 |
| "Raleigh Durham to Chicago" | 300.0 | 1040 | false | 0.288462 |
| "Raleigh Durham to New York" | 420.0 | 693 | true | 0.606061 |
group_by() groups rows defined by one or more variables.
group_by() requires additional code to use the groupings!How many flights are non-stop (true) and “multi-stop” (false)?
group_by() combined with len() (length) will count the rows in each group!If you have multiple aggregations that you wish to perform on each group of data
agg()(or aggregrate) allows one or more calculations per grouping.agg() after a group_by()I know…
I said I was only going to show you five common “verbs” in Polars…
group_by() provides a convenient way to “collapse” data into groups…over() method comes in handy.Let’s calculate the average distance over non-stop and multi-stop flights. - We will insert the calculation for each group in the DataFrame.
| flight | non_stop | average_distance_km |
|---|---|---|
| str | bool | f64 |
| "Raleigh Durham to Svalbard" | false | 3624.5 |
| "Raleigh Durham to London" | true | 2872.25 |
| "Raleigh Durham to Los Angeles" | true | 2872.25 |
| "Raleigh Durham to Boston" | true | 2872.25 |
| "Raleigh Durham to Chicago" | false | 3624.5 |
| "Raleigh Durham to New York" | true | 2872.25 |
If using VSCode or Python IDE:
import polars as pl
rdu_flights = pl.read_parquet("rdu_flights.parquet")
If you are using Google Colab:
import polars as pl
from google.colab import files
files.upload() # choose rdu_flights.parquet
rdu_flights = pl.read_parquet("rdu_flights.parquet")
NOTE - you will need to specify the path to the data
rdu_flights.shaperdu_flights.height and rdu_flights.width will also give you the answer(954988, 45)
rdu_flights.columnsrdu_flights.schemaWhat are the number of departing flights at RDU airport by year? (Bonus: sort by year)
rdu_flights.group_by('year').len()rdu_flights.group_by('year').agg(pl.len().alias('flights'))| year | flights |
|---|---|
| i64 | u32 |
| 2010 | 48924 |
| 2011 | 42620 |
| 2012 | 44016 |
| 2013 | 47983 |
| 2014 | 38813 |
| 2015 | 34898 |
| 2016 | 104895 |
| 2017 | 71874 |
| 2018 | 121448 |
| 2019 | 128312 |
| 2020 | 67720 |
| 2021 | 44136 |
| 2022 | 54049 |
| 2023 | 59416 |
| 2024 | 45884 |
Are departure delays increasing or decreasing at RDU each year?
| year | delayed_flights |
|---|---|
| i64 | u32 |
| 2010 | 7339 |
| 2011 | 6234 |
| 2012 | 6099 |
| 2013 | 8003 |
| 2014 | 6757 |
| 2015 | 5623 |
| 2016 | 16182 |
| 2017 | 11952 |
| 2018 | 23592 |
| 2019 | 23146 |
| 2020 | 5082 |
| 2021 | 5674 |
| 2022 | 9896 |
| 2023 | 10762 |
| 2024 | 9260 |
CDVS - Duke Libraries - askdata@duke.edu
As always, Duke Libraries Center for Data and Visualization Science (askdata@duke.edu) can assist with questions about data management and data wrangling. Consultations are available by appointment.
Polars API
The Polars Python API is an outstanding resource for the latest syntax. If you are questioning the validity of AI suggestions (and sometimes those suggestions are erroneous!), the API can help resolve questions.
Duke Libraries subscribe to the O’Reilly for Higher Education Database where you will find Jeroen Janssens and Thijs Nieuwdorp’s Python Polars: The Definitive Guide. This is an excellent way to learn Polars and serves as compelling reference.
I used FAA flight data pulled using the anyflights API. If you would like to investigate this API further check out:
Couch S (2023). anyflights: Query ‘nycflights13’- Air Travel Data for Given Years and Airports. Code: https://github.com/simonpcouch/anyflights, Context: https://simonpcouch.github.io/anyflights/.
As mentioned in the introduction, Polars is written in Rust and has implementations in Julia, R (check out TidyPolars , and Javascript. If Polars seems compelling and you regularly use one of those languages, please try one of the other implementations!